[Large Tensor] Fixed Spatial Transformer op #17617

connorgoggins · 2020-02-17T23:30:44Z

Description

The Spatial Transformer op was previously breaking on large tensor (dimension >= 2^32) data. With the following input:

run_performance_test(nd.SpatialTransformer, run_backward=True, inputs=[{'data': (2, 2**29,1,6), 'loc': nd.random_normal(shape=(2,6)), 'transform_type': 'affine', 'sampler_type': 'bilinear', 'target_shape': (2,6)}], warmup=1, runs=1)

the following error was thrown:

*** Error in `python3': double free or corruption (out): 0x00007f36bbffe010 ***

To root cause this issue, I ran the previous command in a Python script with GDB, and found that the underlying problem was in the iteration portion of the forward and backward methods of spatial_transformer.cc. Several of the variables used in the iteration used the int dtype when they should have been using index_t to properly handle long int indices. I switched these variables to index_t in the forward and backward methods, and after rebuilding, the previous input command displayed the correct output:

INFO:root:Begin Benchmark - SpatialTransformer
INFO:root:Complete Benchmark - SpatialTransformer
[{'SpatialTransformer': [{'inputs': {'data': (2, 536870912, 1, 6), 'loc': '<NDArray 2x6 @cpu(0)>', 'transform_type': 'affine', 'sampler_type': 'bilinear', 't
arget_shape': (2, 6)}, 'max_storage_mem_alloc_cpu/0': 102005472.0, 'avg_time_forward_SpatialTransformer': 551614.125, 'avg_time_backward_SpatialTransformer':
 511034.0938}]}]

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

M src/operator/tensor/spatial_transformer.cc

Comments

Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with

Individual op run
Full OpPerf run

Results

The key difference between CPU and GPU tests was the instance type (r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, and both were tested using CPU context.

Single operator test - SpatialTransformer op (GPU)
Single operator test - SpatialTransformer op (CPU)

Full OpPerf test (GPU)
Full OpPerf test (CPU)

@apeforest @access2rohit @ChaiBapchya

connorgoggins · 2020-02-17T23:31:52Z

@mxnet-label-bot add [pr-awaiting-review]

access2rohit · 2020-02-17T23:59:39Z

@connorgoggins can you tell me what are the possible types for DType here:
https://github.com/apache/incubator-mxnet/pull/17617/files#diff-64979618e95b2211ff9317b0830fbbd2R37

connorgoggins · 2020-02-18T00:47:24Z

@access2rohit DType here represents the type of the interior elements of the input tensors, and since SpatialTransformer only supports floating point types then DType could be float32 or float64 (also potentially float16, but this DType doesn't seem to have been implemented for GEMM on CPU yet).

ChaiBapchya · 2020-02-19T02:17:37Z

src/operator/spatial_transformer.cc

-          index_t top_right_v = 0;
-          index_t bottom_left_v = 0;
-          index_t bottom_right_v = 0;
+          DType top_left_v = 0;


why is this changed back to Dtype?
It's the index position right?

@ChaiBapchya changing DType to index_t for these specific variables makes the op generate incorrect output on standard inputs (e.g. the inputs in the CI run) - the values generated in the output NDArray are all integers instead of floats. This is due to the fact that these variables do not represent the index positions (as I also originally believed), but instead represent the underlying values at the vertices.

Cool. good catch.
_v indicates the value at that particular index. Thanks!

ChaiBapchya

LGTM!

* Added CPU fix * Added fix for backward on CPU * Fixed lint error * index_t for lower bound instead of hardcoded long int * Fixed remaining lint errors * Removed trailing whitespace * Reverting to DType for vertices * Added nightly test for SpatialTransformer

connorgoggins changed the title ~~[LT] Fixed Spatial Transformer op~~ [Large Tensor] Fixed Spatial Transformer op Feb 17, 2020

lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 17, 2020

connorgoggins force-pushed the fix_spatial_transformer_large_tensor branch from c5ff973 to 871849e Compare February 18, 2020 21:10

ChaiBapchya reviewed Feb 19, 2020

View reviewed changes

ChaiBapchya approved these changes Feb 19, 2020

View reviewed changes

connorgoggins force-pushed the fix_spatial_transformer_large_tensor branch 3 times, most recently from d3696f2 to 50d1530 Compare February 24, 2020 22:55

connorgoggins added 8 commits February 25, 2020 02:01

Added CPU fix

b759cc3

Added fix for backward on CPU

1f795ac

Fixed lint error

195c801

index_t for lower bound instead of hardcoded long int

f8f7338

Fixed remaining lint errors

3be467d

Removed trailing whitespace

1fc5f7f

Reverting to DType for vertices

3fcfd6a

Added nightly test for SpatialTransformer

5ac85a6

connorgoggins force-pushed the fix_spatial_transformer_large_tensor branch from 50d1530 to 5ac85a6 Compare February 25, 2020 10:02

apeforest approved these changes Feb 25, 2020

View reviewed changes

apeforest merged commit 13b3893 into apache:master Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Large Tensor] Fixed Spatial Transformer op #17617

[Large Tensor] Fixed Spatial Transformer op #17617

connorgoggins commented Feb 17, 2020 •

edited

Loading

connorgoggins commented Feb 17, 2020

access2rohit commented Feb 17, 2020

connorgoggins commented Feb 18, 2020

ChaiBapchya Feb 19, 2020

connorgoggins Feb 19, 2020 •

edited

Loading

ChaiBapchya Feb 19, 2020

ChaiBapchya left a comment

[Large Tensor] Fixed Spatial Transformer op #17617

[Large Tensor] Fixed Spatial Transformer op #17617

Conversation

connorgoggins commented Feb 17, 2020 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Results

connorgoggins commented Feb 17, 2020

access2rohit commented Feb 17, 2020

connorgoggins commented Feb 18, 2020

ChaiBapchya Feb 19, 2020

Choose a reason for hiding this comment

connorgoggins Feb 19, 2020 • edited Loading

Choose a reason for hiding this comment

ChaiBapchya Feb 19, 2020

Choose a reason for hiding this comment

ChaiBapchya left a comment

Choose a reason for hiding this comment

connorgoggins commented Feb 17, 2020 •

edited

Loading

connorgoggins Feb 19, 2020 •

edited

Loading